Online Job Scheduling with Redundancy and Opportunistic Checkpointing: A Speedup-Function-Based Analysis
نویسندگان
چکیده
In a large-scale computing cluster, the job completions can be substantially delayed due to two sources of variability, namely, variability in the job size and that in the machine service capacity. To tackle this issue, existing works have proposed various scheduling algorithms which exploit redundancy wherein a job runs on multiple servers until the first completes. In this paper, we explore the impact of variability in the machine service capacity and adopt a rigorous analytical approach to design scheduling algorithms using redundancy and checkpointing. We design several online scheduling algorithms which can dynamically vary the number of redundant copies for jobs. We also provide new theoretical performance bounds for these algorithms in terms of the overall job flowtime by introducing the notion of a speedup function, based on which a novel potential function can be defined to enable the corresponding competitive ratio analysis. In particular, by adopting the online primal-dual fitting approach, we prove that our SRPT+R Algorithm in a non-multitasking cluster is (1 + )-speed, O( 1 )-competitive. We also show that our proposed Fair+R and LAPS+R(β) Algorithms for a multitasking cluster are (4 + )-speed, O( 1 )-competitive and (2 + 2β + 2 )-speed O( 1 β )-competitive respectively. We demonstrate via extensive simulations that our proposed algorithms can significantly reduce job flowtime under both the non-multitasking and multitasking modes.
منابع مشابه
Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid
Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...
متن کاملComparative Analysis of Fault Tolerance Techniques in Grid Environment
Grid being a collection of heterogeneous resources connected through network, to execute complex jobs with high processing power requirements, is more vulnerable to faults. Faults may affect the performance and QoS of Grid. Faults are dealt with either avoiding them or recovering them by either re-execution or by resuming the execution from the point of failure by using the checkpoints. The var...
متن کاملAnalysis of checkpointing for schedulability of real-time systems
Checkpointing is a relatively cost effective method for achieving fault tolerance in real-time systems. Since checkpointing schemes depend on time redundancy, they could affect the correctness of the system by causing deadlines to be missed. This paper provides exact schedulability tests for fault tolerant task sets under specified failure hypothesis and employing checkpointing to assist in fau...
متن کاملOnline Scheduling of Jobs for D-benevolent instances On Identical Machines
We consider online scheduling of jobs with specic release time on m identical machines. Each job has a weight and a size; the goal is maximizing total weight of completed jobs. At release time of a job it must immediately be scheduled on a machine or it will be rejected. It is also allowed during execution of a job to preempt it; however, it will be lost and only weight of completed jobs contri...
متن کاملA Fault Tolerant Scheduling System Based on Checkpointing for Computational Grids
Job checkpointing is one of the most common utilized techniques for providing fault tolerance in computational grids. The efficiency of checkpointing depends on the choice of the checkpoint interval. Inappropriate checkpointing interval can delay job execution. In this paper, a fault-tolerant job scheduling system based on checkpointing technique is presented and evaluated. When scheduling a jo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1707.01655 شماره
صفحات -
تاریخ انتشار 2017